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Abstract 

Language model is one of the most important modules 
in statistical machine translation and currently the word- 
based language model dominants this community. How¬ 
ever, many translation models (e.g. phrase-based mod¬ 
els) generate the target language sentences by render¬ 
ing and compositing the phrases rather than the words. 
Thus, it is much more reasonable to model dependency 
between phrases, but few research work succeed in solv¬ 
ing this problem. In this paper, we tackle this prob¬ 
lem by designing a novel phrase-based language model 
which attempts to solve three key sub-problems: 1, how 
to define a phrase in language model; 2, how to deter¬ 
mine the phrase boundary in the large-scale monolin¬ 
gual data in order to enlarge the training set; 3, how to 
alleviate the data sparsity problem due to the huge vo¬ 
cabulary size of phrases. By carefully handling these is¬ 
sues, the extensive experiments on Chinese-to-English 
translation show that our phrase-based language model 
can significantly improve the translation quality by up 
to +1.47 absolute BLEU score. 


Introduction 

As one of the most important modules in statistical machine 
translation (SMT), language model measures whether one 
translation hypothesis is more grammatically correct than 
other hypotheses. Since the beginning of the statistical era 
for machine translation, word-based language model domi¬ 
nates this community. When the word-based SMT was first 
proposed, the model generates the target translation word 
by word and we need to calculate how fluency is the con¬ 
catenation of the individual words. Thus, the word-based 
language model becomes a natural choice, see Figure 1(b) 
for illustration. However, the more sophisticated translation 
models (e.g. the phrase-based models ( jKoehn et al. 2007 
|Blackwood et al. 2009[ )) manipulate the phrases rather than 
the words during the decoding stage. The phrases come from 
natural language texts and it is less necessary to model the 
dependency among the inner words of the phrases. Instead, it 
is much more beneficial to model the probability distribution 
of a new target phrase conditioned on the previously gener¬ 
ated target phrases, see Figure 1(c) for demonstration. Ob¬ 
viously, the word-based language model cannot depict the 
dependency between the phrases. 
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Figure 1; An example comparing the word-based transla¬ 
tion and the phrase-based translation, (a) shows the Chi¬ 
nese source sentence and English reference, (b) illustrates 
a decoding phase of the word-based model translating the 
third Chinese word, (c) demonstrates a decoding stage of the 
phrase-based model translating the second Chinese phrase. 


Although it is very promising if we can design a good 
phrase-based language model, few research work succeed 
in solving this problem in statistical machine translation. 


In automatic speech recognition community, Heeman and 
Damnati! (|1997[) attempt to derive a phrase-based lan 


guage model. However, their method estimates the condi¬ 
tional probability of the phrases by backing off to words 
rather than considering the phrases as the inseparable units. 
Baisa (|2011} first proposed the chunk-based language model 


(including phrase-based) in machine translation but did not 
give a solution. Recently, Xu and Chen ( 2015[ ) designed a 
direct algorithm for phrase-based language model in statis¬ 
tical machine translation. In their method, phrase can be any 
word sequence. The phrase vocabulary is huge and the data 
sparsity problem is very serious. It leads to difficulty in prob¬ 
ability estimation for phrase-based language model. 


Why so few researchers succeed in proposing good solu¬ 
tions to the phrase-based language model? We believe the 











































reason lies in three difficulties: 1, the phrase can be any 
word string, and it is hard to give a good definition for the 
phrases; 2, it is a problem how to transform the word-based 
large-scale monolingual data into the phrase-based corpus 
for phrase-based language model training; 3, due to the huge 
size of the phrase vocabulary, it is a question how to alleviate 
the data sparsity problem. 

In this paper, we aim at tackling the above three problems 
and propose an effective phrase-based language model. The 
key idea behind is that, through the word-aligned parallel 
sentences, we define and obtain the minimal phrases. Then 
we adopt a sequence labelling approach to find the mini¬ 
mal phrase boundary for the large-scale monolingual data. 
Finally, the deep neural network (DNN) is leveraged to in¬ 
vestigate the data sparsity problem of the phrase-based lan¬ 
guage model. We make the following contributions: 


• In order to make the granularity of the phrase stable 
enough, we define the minimal phrases inspired by the 


concepts of minimal translation units (MTU) in (Zhang et 


|al. 2013| . We further regard the phrase boundary recog¬ 
nition in large-scale monolingual data as a sequence la 
belling task. 


• We investigate the data sparsity problem of the phrase- 
based language model by introducing a deep neural net¬ 
work. We show DNN has the potential to alleviate the data 
sparsity problem although it is not good enough currently 
due to the strict constraint of vocabulary size. 


• Our phrase-based language model achieves substantial 
improvements on the large-scale translation tasks. 


Phrase-based Language Model 

For the translation models which generate the output by 
rendering and compositing the target language phrases, no 
matter whether the translation is obtained using the left-to- 
right algorithm ([Koehn et al. 2007 1 or the bottom-up ap¬ 
proach ( [Xiong, Liu, and Lin 2006J , the phrase-based lan¬ 
guage model is proposed to measure whether one output 
phrase sequence is more grammatically correct than others. 

Given a partial translation candidate in the form of phrase 
sequence t = t p ot p \...t pn , the phrase language model at¬ 
tempts to calculate the following probability: 


P(t) = P(t p0 t pl ...t 

P n ) ^ j ^ 

P(fpo)p{lpl l^po) ■■ -P{tpn \tpO’--tp(n— 1) ) 

Taking the phrase-based decoding in Figure 1(c) for ex¬ 
ample, the translation candidate in the form of phrase se¬ 
quence is He is $ one of $ the few supporters $ of $ this tax 
increase plan , in which $ denotes the phrase boundary. The 
first question is that can we just calculate Equation 1 by set¬ 
ting t p o = He is, t p i = one of, t p i = the few supporters, 
tp 3 = of and t p 4 = this tax increase planl In practice, it is 
not very reasonable due to three reasons. 

First, this kind of phrases lacks of a well-formed defini¬ 
tion and the training data for the phrase language model is 
hard to construct. In order to have a better understanding, 
let us first look at the way generating these phrases. The 


ft -> He 

'ftil ^ He is 

*1 "> is 

Jnl#, -> this tax increase 

-> this 

i+i&J tax increase plan 

-> tax increase 
i+ife'J "> plan 
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« ^of 

if^ij -> this tax increase plan 

-> the few 

StJ -> the few supporters of 

supporters 
- -> one of 

one of the few supporters 


Figure 2: The phrasal translation rules extracted using Fig¬ 
ure 1(a) as a training instance. 


target phrases used in the decoding stage are all from the 
phrasal translation rules, and the phrasal translation rules 
are extracted from the word-aligned parallel sentences. Any 
word sequence pair satisfying the word alignment becomes 
a phrasal translation rule. If the sentence pair in Figure 1(a) 
is used as a training instance and at most three source-side 
words are allowed, we can extract 17 phrasal translation 
rules as shown in Figure 2. We see that four target English 
phrases containing the word tax. That is to say, there are 
at least four ways partitioning the phrases containing the 
word tax. Thus, which partition should be adopted to train 
the phrase-based language model? Obviously, lacking of a 
well-defined phrase concept makes the construction of the 
training data impossible. 

Second, during decoding the phrase-based SMT considers 
multiple overlapping segmentations of the same sentence. 
Thus, one problem with the phrase-based language model is 
that a unique segmentation of a sentence into phrases is not 
available a-priori. 

Third, the phrase vocabulary is too huge to accurately es¬ 
timate the parameters. The bilingual training data consist¬ 
ing of millions of sentence pairs may extract distinct target 
phrases in tens of or hundreds of millions. Consequently, it 
is impossible for us to collect enough training data for accu¬ 
rate parameter estimation. 

To make it possible for training data construction and ac¬ 
curate parameter estimation, we propose to define the con¬ 
cept minimal phrase. 

Definition of Minimal Phrase 

We define the minimal phrase in the context of the bilin¬ 
gual sentence pairs. Informally, a minimal phrase in the tar¬ 
get language sentence is a continuous word sequence which 
contains the minimum words without violating the con¬ 
straints of word alignment. 

Given a word-aligned parallel sentence pair (s,t,A), in 
which s = soSi...s m _i is the source language sentence, t = 
foG-.-fn-i is the target language sentence and A C {0..to — 
1} x { 0 ..77.— 1} is the word alignment between the source and 
target words. A minimal phrase is a target word sequence 
ti-.tj which satisfies the following constraints: 

• There exists a source word sequence Sk-.si (may be 

empty) such that for all aligned pairs ( 1 ', k'), we require 

i < i' < j if k < k! < l. 

• There is no smaller non-empty target word sequence 

t a ...ti, (i < a < b < j or i < a < b < j) which meets the 
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Figure 3: An illustration for minimal phrases in target sen¬ 
tence. This automatic word alignment is slightly different 
from the manually labelled word alignment in Figure 1(a). 


Feature Template 

Explanation 

U00:%x[-2] 

second word preceding current word 

U01:%x[-1] 

first word preceding current word 

U02:%x[0] 

current word 

U03:%x[l] 

next word after current word 

U04:%x[2] 

second next word after current word 

U05: %x[-2] %x[-1 ] 

combination of left two words 

U06:%x[-l]%x[0] 

combination of preceding and current words 

U07:%x[0]%x[l] 

combination of current and next words 

U08:%x[l]%x[2] 

combination of next two words 


Table 1: Feature templates for minimal phrase identifica¬ 
tion. 


first condition. 


The first constraint is identical to that of the phrasal trans¬ 
lation rule extraction ( Koehn et al. 2007| except that the 
source word sequence can be empty in our case. The second 
constraint guarantees that the phrase is the shortest. From 
the view of the phrase pair, it is similar to that of minimum 
translation units ( Zhang et al. 201 3( >. However, there are two 
main differences: 1, we only care about the minimal phrase 
in the target language side; 2, the target phrase in minimum 
translation units can be empty ( (Zhang et al. 2013] ) while our 
target minimal phrase must contain one word at least. 

All the identified minimal phrases in the target language 
sentence will form a partition for this sentence. Among all 
the partitions that satisfy the word alignment constraints, the 
set of the minimal phrases is the partition with the shortest 
average phrase length. It should be noted that for each word- 
aligned sentence pair, the partition of the minimal phrases is 
unique. Moreover, any target phrase of the phrasal trans¬ 
lation rules can be viewed as the composition of the min¬ 
imal phrases. For example. Figure 3 shows a word-aligned 
sentence pair and the its corresponding minimal phrases. All 
the five target phrases used in Figure 1(c) can be obtained by 
concatenating the minimal phrases. t p o = mpo, t p \ = mpi, 
t P 2 = mp 2 +inp 3 , t p3 = mp 4 and f p4 = mp 5 +mp e +mp 7 . 
Then, Equation 1 becomes: 


P(t) = P(t p 0 t p i...t pn ) = P(mp 0 mpi...mp n ) 

= Hi = 0 P(mp i \mp 0 ...mp i _ 1 ) 

To compute Equation 2, there are still two problems to be 
solved. One needs to collect the sufficient training data, and 
the other needs to design the parameter training algorithm. 

Identifying Minimal Phrases In Monolingual Data 

We know from the previous section that identifying the min¬ 
imal phrases is trivial given the word-aligned parallel sen¬ 
tence pair. Naturally, the corresponding unique partition for 
the target sentence can be utilized to estimate the language 
model parameters. However, the bilingual resources are al¬ 
ways limited. Therefore, the target part of the bitext is too 
limited to train a powerful phrase-based language model. 
The potential is to explore the large-scale monolingual data. 

For traditional word-based language model, the target lan¬ 
guage monolingual data can be used directly after some pre¬ 
processing. But, for our language model based on minimal 


phrases, the first problem is to partition the monolingual sen¬ 
tences into sequence of the minimal phrases. Without source 
language sentence and the word alignment, identifying the 
minimal phrases is a difficult problem. 

Consider any target language monolingual sentence 
tot\...t n , our goal is to find the best partition of minimal 
phrases to-.ti, fj+i-.ffe,..., tj..t n . If we focus on each single 
word, the task becomes to determine its category: beginning, 
middle or ending of a minimal phrase [j] Consequently, we 
can formalize the problem as a sequence labelling task. In 
machine learning, there are many discriminative methods for 
sequence labelling. We choose the simple but effective per- 
ceptron algorithm ( [Collins and Duffy 20Q2| > to do this job. 

The perceptron algorithm maps an input sentence t £ T 
onto an output minimal phrase sequence y £ Y where T is 
the set of the target monolingual sentences and Y is the set 
of all the possible minimal phrase partitions. Given an input 
sentence t, the output F(t) is defined as the highest score 
among the possible minimal phrase partitions for t: 


F(t) = argmax <b(f, y) ■ W (3) 

y£outputs(t ) 

Where $(i, y) is the global feature vector and W denotes 
the corresponding feature weights vector. What remains is to 
design the feature templates and construct a training corpus 
to learn the model parameters W. 

For the feature templates, we assume that the category of 
the current word is determined by its surrounding words. 
Specifically, Table 1 shows all the feature templates we have 
used in our work. 

For the training corpus, we regard the minimal phrase par¬ 
titions obtained from automatically word-aligned parallel 
sentences (which are used to train translation model in SMT) 
as the gold data, so that we can leam the feature weights W. 
During training, the Averaged perceptron is used. 

With the trained model, we use it to label the large-scale 
monolingual data. The resulting large-scale minimal phrase 
partitions plus those obtained from parallel sentences will 
serve as the training data to perform parameter estimation. 


'We follow the convention of character-based Chinese word 
segmentation and define the set of categories as {B, M, E, S} 





















Parameter Estimation with Deep Learning 

Given the large-scale training data of minimal phrase parti¬ 
tions, we are able to train a phrase-based language model. 
Following the previous conventions, we adopt a Markov 
model of order N-l to calculate the probability of a mini¬ 
mal phrase sequence: 


P(t) = P(mp 0 mpi...mp n ) 

faU2 =0 P(mpi\mpi- N+ i...mpi-i) 

The standard count-based probability models, such as 
Kneser-Ney back off (Kneser and Ney 1995}, are leveraged 
to estimate the probability of a word given the preceding N- 
1 words. It can also be utilized to estimate the probability of 
a minimal phrase given N-l previous minimal phrases. We 
call this model MP-KN. 

However, this MP-KN model may encounter the data 
sparsity problem since the granularity of minimal phrases is 
bigger than the words and the sequence of minimal phrases 
is less likely to appear frequently. Furthermore, this MP-KN 
model cannot take full advantage of similar minimal phrases 
as they are treated totally different in the MP-KN model. 
Fortunately, the neural networ k language models (|Bengio et 
|al. 2003 [ |Vaswani et al. 2013) |Devlin et al. 2014) are able 

to give a probability for any N-gram sequence and take ad¬ 
vantage of arbitrary large context. Thus, we turn to the deep 
neural networks to train the phrase-based language model. 


Neural Network Structure 

Our neural network structure is a feed-forward neural net¬ 
work and it is almost identical to the one described in 
( |Vaswani et al. 2013) l. 

As Figure 4 shows, the input vector is the concatena¬ 
tion of N-l minimal phrase context vectors, in which each 
minimal phrase is mapped onto a 128-dimensional vec¬ 
tor using a shared embedding matrix. Through two 256- 
dimensional hidden layers with rectified linear activation 
function = max{ 0,x)) (Gutmann and Hyvarinen 

2010), we apply softmax function in outpuflayer to calculate 


the probability for each minimal phrase in the vocabulary: 


jjt ^ _ exp{D mp {x)) 

W ~ Z(x) 

_ ( 5 ) 

Z(x) = exp(D mp f{x)) 

mp' £ V 

Where x denotes one sample, D mp (x) is the raw output 
layer score of the observed minimal phrase mp and Z(x) is 
the normalization term. 

We should notice that the N-l minimal phrase context 
maybe empty or incomplete. The previous methods usu¬ 
ally fill the context with OOV or NULL , and adopt only 
one neural network. We believe that the estimated prob¬ 
ability would be inaccurate. Instead, we train N-l neu¬ 
ral networks and resort to unigram probability if the con¬ 
text is empty. When calculating the probability of a min¬ 
imal phrase sequence, we choose the corresponding neu¬ 
ral network according to the size of the available context. 



output p(mplcontext) 


second hidden layer h 2 


first hidden layer hj 


input vector 


Figure 4: Neural network structure for our phrase-based lan¬ 
guage model. 


For example, if N = 5 (it is used in our experiments), 
we will build 4 networks mp-nnb, mpjnnA, mp-nn3 and 
mpjnn2. Suppose we need to compute the probability of the 
minimal phrase sequence mpo'mpimp 2 , mp 3 mp 4 . mpjnnb, 
mpjnnA , mpjrmS and mpjnn2 will be applied to cal¬ 
culate p(mp 4 \mpompimp 2 mp 3 ), p(mp3\mpompimp2), 
p(mp 2 \mpompi) and p(mpi\mpo) respectively. 

Neural Network Training 

Typically, the neural network is optimized using standard 
back propagation and stochastic gradient ascent algorithm 
fLeCun et al. 1998} to maximize the log-likelihood of the 
training data. However, the softmax layer requires to sum 
over all the minimal phrases in the vocabulary for each 
forward computation and it is too time consuming. Re¬ 
cent years witnessed the progress of developing the self- 
normalized neural networks. 

[Vaswani et al . | (j2013| a dopt the Noisy Contrastive Es¬ 
timation (NCE) ( jGutmann and Hyvarinen 2010} to avoid 
the normalization in output layer. In contrast, jDevlin et 
|akj ( |2014} explicitly add a constraint for the normalization 
term ( logZ(x ) = 0) in the objective function. In our work, 
we apply the NCE approach. 

The main idea behind NCE is that for each training sam¬ 
ple x we choose k noise samples and the network is trained 
to classify the examples as training sample or noise. The ob¬ 
jective function is the conditional likelihood: 


N 

L = ^2(logP{C = 1| Xi) + logP(C = 0|£j)) (6) 

i -1 

Here (7 = 0 says x, is noise rather than a training sample. 


Experiments 
Translation System 

We have implemented a phrase-based translation system 
with a maximum entropy based reordering model using the 
bracketing transduction grammars ( Wu 1997] [Xiong, Liu, 


|and Lin ~20 06; Zhang a nd Zong 2013} . In this translation 
system, the phrasal translation rule > (x. y) converts a 
source language phrase into a target language phrase, and 



































forms a block. The monotone merging rule A —> [A 1 . A r ] 
combines the two consecutive blocks into a bigger block by 
concatenating two partial target translation candidates in or¬ 
der while the swap rule A —> (A, A'") swaps the two par¬ 
tial target candidates. Obviously, the phrase-based language 
model could play an important role in determining whether 
the result translation (sequence of phrases) are fluent or not. 


Data Preparation 

The evaluation is conducted on Chinese-to-English transla¬ 
tion. The bilingual training data from LDC contains about 
2.06 million sentence pairs with 27.7M Chinese words and 
31.9M English words. NIST MT03 is used as the tuning 
data. MT05, MT06 and MT08 (news data) are used as the 
test data. Case-insensitive BLEU is employed as the eval¬ 
uation metric. The statistical significance test is performed 
with the pairwise re-sampling approach (Koehn 2004). 

As the language model is the focus of this work, we in¬ 
vestigate four language models. The target side raw data in¬ 
cludes the English part of the bilingual data and the mono¬ 
lingual Xinhua portion of the English Gigaword. The whole 
target language data contains about 300 million words. Four 
different language models are detailed as follows: 


W-KN: It is the conventional word-based language model 
using Kneser-Ney count smoothing. 

W-NN: It is the word-based neural language model first 
proposed by Bengi o~et al.| ( |2003|) and succe ssfully ap¬ 
plied to machine translation by Vaswani et al. (2013) and 
Devlin et al.| ( 2014|i. They adopt the hierarchical phrase- 


based model (Chiang 2007) as their baseline while we em¬ 
ploy the BTG-based model to be our baseline. In W-NN, 
we keep the top 160K frequent words in the vocabulary. 


• MP-KN: It is the language model using minimal phrases 
as basic units and trained with Kneser-Ney count smooth¬ 
ing. The raw training data is the same as that of W-KN. 
Then, the English part of the bilingual data is partitioned 
into minimal phrases using minimal phrase definition and 
the Xinhua portion of Gigaword is partitioned into mini¬ 
mal phrases using the sequence labelling algorithm. 


• MP-NN: It is the neural language model with minimal 
phrases serving as basic units using the feed forward neu¬ 
ral network introduced in the previous section. The train¬ 
ing data is the same as that of MP-KN, but we retain only 
top 160K frequent minimal phrases in the vocabulary. 


The different language models will be integrated into the 
log-linear translation model as the additional features. 


Experimental Results on Minimal Phrase Partition 

Before detailing the translation performance, we first report 
the performance of minimal phrase partition in the large- 
scale monolingual data. We perform word alignment for the 
Chinese and English reference sentences on NIST MT03. 
Based on the word alignment constraints, we obtain the 
English-side minimal phrase partitions which are employed 
as the gold test data. We apply the trained perceptron algo¬ 
rithm to partition the English reference sentence of NIST 
MT03. We compare this result with the gold test data. 


Method (Perplexity) 

MT03 

MT05 

MT06 

MT08 

W-KN (107.39) 

35.81 

34.69 

33.83 

27.17 

W-NN (130.12) 

34.73 

33.62 

32.75 

26.54 

MP-KN (89.95) 

34.39 

33.26 

32.51 

25.65 

MP-NN (70.55) 

33.65 

32.83 

31.96 

25.21 

W-KN-NN 

36.40 

35.45 

34.58 

27.87 

W-KN+MP-KN 

36.26 

35.36 

34.45 

25.54 

W-KN+MP-KN-NN 

36.83 

35.87 

35.30 

28.40 

W-KN-NN+MP-KN-NN 

36.95 

36.13 

35.56 

28.92 


Table 2: Translation performance of different language 
model settings. Bold numbers denote that the model signifi¬ 
cantly outperforms the baseline W-KN with p < 0.01. 


The precision, recall and Fl-score are 0 . 83 , 0.873 and 
0.851 respectively. It demonstrates that the performance of 
the minimal phrase partition for the monolingual data is 
quite good and the minimal phrase partitions on the large- 
scale monolingual data are very reliable to be used for train¬ 
ing the phrase-based language model. 

Experimental Results on Translation Quality 

To have a comprehensive understanding about how different 
language models influence the translation quality, we con¬ 
duct and compare eight language model settings. Table 2 re¬ 
ports the detailed results. 

For the first four lines, we just use one language model 
in the translation system. We plan to figure out the follow¬ 
ing two questions: 1, can the neural language model substi¬ 
tute the Kneser-Ney count-based language model? 2, can the 
phrase-based language model replace the word-based lan¬ 
guage model in SMT? Comparing W-NN with W-KN in Ta¬ 
ble 2, we can find that applying only the word-based neural 
language model cannot perform as well as the word-based 
Kneser-Ney language model. The similar phenomenon ex¬ 
its for the phrase-based language model (MP-KN vs. MP- 
NN). In theory, DNN can alleviate the data sparsity prob¬ 
lem. However, due to the strict vocabulary size constraint, it 
cannot substitute the Kneser-Ney language model currently. 

About the second question, we are unfortunate to see 
that the phrase-based language model cannot outperform the 
word-based language model when comparing MP-KN with 
W-KN. We find the reason after analysing the decoding pro¬ 
cess. Since the minimal phrase vocabulary is much larger 
than word vocabulary (4 vs. 1.7 in million), the n-gram hit 
rate of the phrase-based language model is much lower than 
that of word-based language model during decoding. Table 
3 gives the statistics on the test sets. Since we apply 5-gram 
model in both word- and phrase-based language models, the 
5-gram hit rate during decoding is a key factor that influ¬ 
ences the translation quality. As shown in Table 3 that the hit 
rate of the phrase-based language model is lower than that of 
the word-based language model by approximate 10 percent. 
Therefore, the phrase-based language model has to back off 
to the lower order model more frequently. We also trained 
MP-KN with only minimal phrase partitions from bilingual 
corpus. It achieves the performance of 32.23, 31.09, 30.17 
and 23.48 BLEU on MT03, MT04, MT05 and MT06 re- 


































Language Model 

MT05 

MT06 

MT08 

W-KN 

0.2544 

0.2917 

0.2347 

MP-KN 

0.1573 

0.1911 

0.1443 


Table 3: 5-gram hit rates for test sets during decoding. 


spectively. Obviously, excluding the large-scale monolin¬ 
gual data degrades the translation quality dramatically. 

We also calculate the perplexity of the four language mod¬ 
els on the English references of the test sets. As shown in 
brackets of Table 2, the perplexity has little relationship with 
the translation quality. We can see that phrase-based lan¬ 
guage model has the smaller perplexity. However, we be¬ 
lieve that they are not comparable as their basic units are 
different. We also notice that, for the phrase-level, the neu¬ 
ral language model has a lower perplexity. But, for the word- 
level, the neural language model has a higher perplexity. We 
will further study this phenomenon in our future work. 

Although the phrase-based language model cannot sur¬ 
pass the word-based language model when used indepen¬ 
dently, both of them should be indispensable in measuring 
the quality of the translation output. To prove this, we incor¬ 
porate multiple language models into the translation system. 
The last four lines in Table 2 show the results. 

Following Vaswani et al. ( 2013| l, we first test the word- 
based neural language model. Based on the word-based 
Kneser-Ney language model, the neural language model 
significantly improves the translation performance (W-KN- 
NN) with the largest gains 0.76 BLEU on MT05. It is in line 
with the conclusions in ( [Vaswani et al. 2013) . 

At the basis of word-based Kneser-Ney model, we in¬ 
tegrate a Kneser-Ney based phrase language model. The 
translation results (W-KN+MP-KN in Table 2) demonstrate 
that the phrase-based language model is much beneficial to 
improve the translation performance. It outperforms W-KN 
significantly on MT05 and MT06. 

When we further incorporate a neural phrase language 
model, the translation quality can be upgraded dramatically 
(W-KN+MP-KN-NN). It obtains significant gains (more 
than 1.0 BLEU score) over W-KN on all the test sets and 
the biggest improvement can be up to 1.47 BLEU score on 
MT06. Moreover, this system even significantly outperform 
the neural network augmented word-based language model 
W-KN-NN. It indicates that the neural network can explore 
deep information (such as syntactic and semantic similari¬ 
ties) of the phrase-based language model. 

We are fortunate to see in last line of Table 2 that the im¬ 
provements of the four language models are additive. It gets 
the best performance on MT08, outperforming W-KN by 
1.75 BLEU score. It can achieve an improvement of more 
than 1.0 BLEU score over the strong system W-KN-NN. 
It demonstrates that our proposed phrase-based language 
model is very helpful in improving the translation quality. 


Related Work 

In statistical machine translation, few research work has 
done well in designing a phrase-based language model. 


Many researchers attempt to go beyond the word 
based language model and augment the translation sys 
tem with syntax-based language models. Charniak, Knight, 


and Yamada (2003) design a CFG-based syntax language 


model for translation output reranking. |Shen, Xu, and| 


Weischedel (2008 1 propose a dependency language model 
for the hierarchical phrase-based system (jChiang 2007|l). 


Post and Gildea| (|2009 1, |Xiao, Zhu, and Zhu| ( [201 1 [ > and 
Zhang, Zhai, and Zong (2013)> propose a tree substitution 


grammar based syntax language model for the string-to-tree 
translation model. However, these syntax-based language 
models much increase the decoding time and they are very 
difficult to be integrated into the phrase-based translation 
systems which just generate translation outputs phrase by 
phrase. 


Balsa] ( |2011 j i gives a proposal for the chunk-based lan¬ 
guage model (including phrase-based) in machine trans¬ 


lation but does not give a solution. Recently, Xu and 
|Chen[ ( j2015| > present an approach for phrase-based language 
model in statistical machine translation. Their approach con¬ 
siders any word sequence to be a phrase. It leads to huge 
phrase vocabulary and severe data sparsity. As a result, the 
conditional probability between phrases is very difficult to 
be estimated. They report slight improvements on a small 
IWSLT data set. In contrast, we give an exact and reasonable 
definition about minimal phrase, and we also propose a se¬ 
quence labelling algorithm to partition the large-scale mono¬ 
lingual data. Furthermore, we adopt the deep neural network 
to better estimate the conditional probability between mini¬ 
mal phrases. Finally, we obtain significant improvements on 
the large-scale NIST data set. 

The most relevant work to ours is the bilingual n-gram 


translation model (Marino et al. 2006 

Crego and Yvon 

2010. Durrani, Fraser, and Schmid 20i3 

Zhang et al. 2013 

Hu et al. 2014). Their Markov model which generates trans- 


lation by arranging sequence of tuples is very similar to 
an n-gram language model. The tuple can be any bilin¬ 
gual phrase pair at the early time (Marino et al. 2006 


Crego and Yvon 2010). Recently, tuples become the min¬ 


imal translation units (MTU) which are the smallest bilin¬ 
gual phrases satisfying the word alignment. 

|and Schmid| ( |2013] l and |Zhang et al.| ( [2013 


Durrani, Fraser, 


perform trans¬ 


lation by compositing the MTUs with a Markov model. Hu 
et al. (2014)1 apply a recurrent neural network to address the 


sparsity problem of MTUs. 


Generally speaking, the above MTU-based model can be 
considered as a bilingual version of our minimal phrase 
based language model. However, our method is quite dif¬ 
ferent from theirs. First, we focus on language models 
while they consider translation models. Second, our mini¬ 
mal phrase cannot be empty while they allow empty. Third, 
their MTUs contain both source- and target-side phrase, and 
it leads to much more serious sparseness problem than our 
model. Fourth, the MTUs are inherent bilingual and cannot 
make use of the monolingual data. In contrast, our phrase- 
based language model can take full advantage of the large- 
scale monolingual data. 
























































































Conclusion and Future Work 

In this paper, we have presented a novel phrase-based lan¬ 
guage model for statistical machine translation. We first gave 
the definition of the minimal phrases. Then, to make full use 
of the large-scale monolingual data for phrase-based lan¬ 
guage model training, we developed a sequence labelling 
algorithm to partition the monolingual data into minimal 
phrases. Finally, we designed a deep neural network to bet¬ 
ter learn the parameters of the phrase-based language model. 
The extensive experiments on Chinese-to-English transla¬ 
tion demonstrated that the proposed phrase-based language 
model significantly improved the translation performance. 

In the future work, we plan to explore our phrase-based 
language model in two directions. For one thing, we are go¬ 
ing to further address the sparsity problem of the minimal 
phrases. For another thing, we will incorporate our phrase- 
based language model into other translation models, such 
as the MTU-based translation model and the hierarchical 
phrase-based translation model. 
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